PPO1 (PPO-Clip) — low-level PyTorch implementation#
Goal: implement the classic clipped surrogate objective version of Proximal Policy Optimization (often referred to as PPO1 in older codebases) using plain PyTorch (no RL libraries), and visualize:
policy probability ratios (r_t) and clipping behavior (Plotly)
learning curves and reward per episode (Plotly)
This notebook is designed to be offline-friendly and runs on CartPole-v1 (Gymnasium).
Notebook roadmap#
PPO1 objective: intuition + the clipped surrogate (LaTeX)
A minimal PyTorch actor-critic
Rollout collection + GAE((\gamma,\lambda))
PPO clipped update (multiple epochs + minibatches)
Plotly visualizations: ratios, clipping, reward per episode
Stable-Baselines PPO1 reference implementation (web research)
Hyperparameters (what they do + tuning tips)
Prerequisites#
Python + PyTorch
Gymnasium (
gymnasium)Plotly
Everything is self-contained (no downloads).
import math
import random
from dataclasses import dataclass
import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
print("torch:", torch.__version__)
print("gymnasium:", gym.__version__)
print("plotly:", plotly.__version__)
torch: 2.7.0+cu126
gymnasium: 1.1.1
plotly: 6.5.2
# --- Reproducibility ---
SEED = 7
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
# --- Run configuration ---
FAST_RUN = True # set False for longer training
ENV_ID = "CartPole-v1"
ROLLOUT_STEPS = 512 if FAST_RUN else 2048
N_UPDATES = 40 if FAST_RUN else 200
TOTAL_TIMESTEPS = N_UPDATES * ROLLOUT_STEPS
UPDATE_EPOCHS = 4
MINIBATCH_SIZE = 128
GAMMA = 0.99
GAE_LAMBDA = 0.95
CLIP_EPS = 0.2
LEARNING_RATE = 3e-4
ADAM_EPS = 1e-5
ENT_COEF = 0.0
VF_COEF = 0.5
MAX_GRAD_NORM = 0.5
# Extra logging
LOG_EVERY_UPDATES = 1
# Device (suppress noisy CUDA init warnings in restricted environments)
import warnings
warnings.filterwarnings('ignore', message='CUDA initialization:.*')
cuda_ok = bool(torch.cuda.is_available())
device = torch.device("cuda" if cuda_ok else "cpu")
print("device:", device)
print("updates:", N_UPDATES)
print("total_timesteps:", TOTAL_TIMESTEPS)
device: cpu
updates: 40
total_timesteps: 20480
1) PPO1 / PPO-Clip objective (clipped surrogate)#
PPO maintains a current policy (\pi_{\theta}) and a behavior (old) policy (\pi_{\theta_{\mathrm{old}}}) that generated a batch of data.
Define the probability ratio:
[ r_t(\theta) = \frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)} ]
Let (A_t) be an advantage estimate (commonly GAE). The clipped surrogate objective is:
[ L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\left[\min\Big( r_t(\theta)A_t,; \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon),A_t \Big)\right] ]
Intuition:
If (A_t > 0): we want (\pi_\theta) to increase probability of (a_t), but we cap the improvement when (r_t) exceeds (1+\epsilon).
If (A_t < 0): we want (\pi_\theta) to decrease probability of (a_t), but we cap the degradation when (r_t) falls below (1-\epsilon).
In code we typically minimize the negative objective: policy_loss = -mean(min(...)).
2) Environment#
We use CartPole-v1 (discrete actions, low dimensional state). PPO also works for continuous actions; the PPO1 clipping logic is the same.
env = gym.make(ENV_ID)
env.action_space.seed(SEED)
obs_dim = int(np.prod(env.observation_space.shape))
assert isinstance(env.action_space, gym.spaces.Discrete)
action_dim = env.action_space.n
print("obs_dim:", obs_dim)
print("action_dim:", action_dim)
obs_dim: 4
action_dim: 2
3) Low-level PyTorch actor-critic#
We implement:
actor: outputs logits for a categorical action distribution
critic: outputs state-value (V(s))
No helper RL libraries; just torch.
class ActorCritic(nn.Module):
def __init__(self, obs_dim: int, action_dim: int, hidden: int = 64):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(obs_dim, hidden),
nn.Tanh(),
nn.Linear(hidden, hidden),
nn.Tanh(),
)
self.policy_head = nn.Linear(hidden, action_dim)
self.value_head = nn.Linear(hidden, 1)
# Orthogonal init is common for PPO
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.orthogonal_(m.weight, gain=math.sqrt(2))
nn.init.constant_(m.bias, 0.0)
nn.init.orthogonal_(self.policy_head.weight, gain=0.01)
def forward(self, obs: torch.Tensor):
x = self.backbone(obs)
logits = self.policy_head(x)
value = self.value_head(x).squeeze(-1)
return logits, value
@torch.no_grad()
def act(self, obs: torch.Tensor):
logits, value = self.forward(obs)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample()
log_prob = dist.log_prob(action)
entropy = dist.entropy()
return action, log_prob, entropy, value
def evaluate_actions(self, obs: torch.Tensor, actions: torch.Tensor):
logits, value = self.forward(obs)
dist = torch.distributions.Categorical(logits=logits)
log_prob = dist.log_prob(actions)
entropy = dist.entropy()
return log_prob, entropy, value
agent = ActorCritic(obs_dim, action_dim).to(device)
optimizer = torch.optim.Adam(agent.parameters(), lr=LEARNING_RATE, eps=ADAM_EPS)
4) Rollouts + GAE#
We collect an on-policy rollout of length ROLLOUT_STEPS, then compute:
advantages (A_t) via Generalized Advantage Estimation (GAE)
returns (R_t = A_t + V(s_t))
Finally we do multiple epochs of minibatch optimization on the same rollout.
@dataclass
class Rollout:
obs: torch.Tensor
actions: torch.Tensor
log_probs: torch.Tensor
values: torch.Tensor
rewards: torch.Tensor
dones: torch.Tensor
advantages: torch.Tensor
returns: torch.Tensor
def compute_gae(
rewards: np.ndarray,
values: np.ndarray,
dones: np.ndarray,
last_value: float,
*,
gamma: float,
lam: float,
):
"""GAE for a single-environment rollout."""
T = len(rewards)
adv = np.zeros(T, dtype=np.float32)
gae = 0.0
for t in reversed(range(T)):
next_nonterminal = 1.0 - float(dones[t])
next_value = last_value if t == T - 1 else values[t + 1]
delta = rewards[t] + gamma * next_value * next_nonterminal - values[t]
gae = delta + gamma * lam * next_nonterminal * gae
adv[t] = gae
ret = adv + values
return adv, ret
def collect_rollout(env, agent: ActorCritic, rollout_steps: int, obs: np.ndarray):
obs_list = []
action_list = []
logp_list = []
value_list = []
reward_list = []
done_list = []
episode_returns = []
ep_return = 0.0
for _ in range(rollout_steps):
obs_tensor = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
action, logp, entropy, value = agent.act(obs_tensor)
action_item = int(action.item())
next_obs, reward, terminated, truncated, _ = env.step(action_item)
done = bool(terminated or truncated)
obs_list.append(obs)
action_list.append(action_item)
logp_list.append(float(logp.item()))
value_list.append(float(value.item()))
reward_list.append(float(reward))
done_list.append(done)
ep_return += float(reward)
obs = next_obs
if done:
episode_returns.append(ep_return)
ep_return = 0.0
obs, _ = env.reset()
# Bootstrap value at the end of rollout
with torch.no_grad():
obs_tensor = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
_, last_value = agent.forward(obs_tensor)
last_value = float(last_value.item())
obs_arr = np.asarray(obs_list, dtype=np.float32)
actions_arr = np.asarray(action_list, dtype=np.int64)
logp_arr = np.asarray(logp_list, dtype=np.float32)
values_arr = np.asarray(value_list, dtype=np.float32)
rewards_arr = np.asarray(reward_list, dtype=np.float32)
dones_arr = np.asarray(done_list, dtype=np.bool_)
adv_arr, ret_arr = compute_gae(
rewards_arr,
values_arr,
dones_arr,
last_value,
gamma=GAMMA,
lam=GAE_LAMBDA,
)
# Advantage normalization is a common PPO trick
adv_arr = (adv_arr - adv_arr.mean()) / (adv_arr.std() + 1e-8)
rollout = Rollout(
obs=torch.tensor(obs_arr, dtype=torch.float32, device=device),
actions=torch.tensor(actions_arr, dtype=torch.int64, device=device),
log_probs=torch.tensor(logp_arr, dtype=torch.float32, device=device),
values=torch.tensor(values_arr, dtype=torch.float32, device=device),
rewards=torch.tensor(rewards_arr, dtype=torch.float32, device=device),
dones=torch.tensor(dones_arr.astype(np.float32), dtype=torch.float32, device=device),
advantages=torch.tensor(adv_arr, dtype=torch.float32, device=device),
returns=torch.tensor(ret_arr, dtype=torch.float32, device=device),
)
return rollout, episode_returns, obs
5) PPO1 update step#
For each rollout batch we optimize the clipped surrogate objective over several epochs/minibatches.
We also log ratio statistics so we can visualize clipping.
def ppo_update(agent: ActorCritic, optimizer: torch.optim.Optimizer, rollout: Rollout):
batch_size = rollout.obs.shape[0]
b_inds = np.arange(batch_size)
policy_losses = []
value_losses = []
entropies = []
clip_fracs = []
approx_kls = []
for _ in range(UPDATE_EPOCHS):
np.random.shuffle(b_inds)
for start in range(0, batch_size, MINIBATCH_SIZE):
end = start + MINIBATCH_SIZE
mb_inds = b_inds[start:end]
obs_b = rollout.obs[mb_inds]
actions_b = rollout.actions[mb_inds]
old_logp_b = rollout.log_probs[mb_inds]
adv_b = rollout.advantages[mb_inds]
ret_b = rollout.returns[mb_inds]
new_logp, entropy, value = agent.evaluate_actions(obs_b, actions_b)
log_ratio = new_logp - old_logp_b
ratio = log_ratio.exp()
# PPO clipped surrogate
unclipped = ratio * adv_b
clipped = ratio.clamp(1.0 - CLIP_EPS, 1.0 + CLIP_EPS) * adv_b
policy_loss = -torch.mean(torch.minimum(unclipped, clipped))
value_loss = F.mse_loss(value, ret_b)
entropy_mean = torch.mean(entropy)
loss = policy_loss + VF_COEF * value_loss - ENT_COEF * entropy_mean
optimizer.zero_grad(set_to_none=True)
loss.backward()
nn.utils.clip_grad_norm_(agent.parameters(), MAX_GRAD_NORM)
optimizer.step()
# Diagnostics
with torch.no_grad():
approx_kl = torch.mean(old_logp_b - new_logp).item()
clip_frac = torch.mean((torch.abs(ratio - 1.0) > CLIP_EPS).float()).item()
policy_losses.append(policy_loss.item())
value_losses.append(value_loss.item())
entropies.append(entropy_mean.item())
clip_fracs.append(clip_frac)
approx_kls.append(approx_kl)
return {
"policy_loss": float(np.mean(policy_losses)),
"value_loss": float(np.mean(value_losses)),
"entropy": float(np.mean(entropies)),
"clip_frac": float(np.mean(clip_fracs)),
"approx_kl": float(np.mean(approx_kls)),
}
6) Train PPO1 on CartPole#
We train for TOTAL_TIMESTEPS and record:
reward per episode
PPO diagnostics (losses, clip fraction, KL)
a final batch of ratios/advantages for plotting
episode_rewards = []
logs = []
last_ratio_snapshot = None
last_adv_snapshot = None
last_clip_active_snapshot = None
obs, _ = env.reset(seed=SEED)
for update in range(1, N_UPDATES + 1):
rollout, ep_returns, obs = collect_rollout(env, agent, ROLLOUT_STEPS, obs)
episode_rewards.extend(ep_returns)
metrics = ppo_update(agent, optimizer, rollout)
# Capture ratio/adv snapshots (after the update) for visualization
with torch.no_grad():
new_logp, _, _ = agent.evaluate_actions(rollout.obs, rollout.actions)
ratio = (new_logp - rollout.log_probs).exp()
adv = rollout.advantages
clip_active = ((adv >= 0) & (ratio > 1.0 + CLIP_EPS)) | (
(adv < 0) & (ratio < 1.0 - CLIP_EPS)
)
last_ratio_snapshot = ratio.detach().cpu().numpy()
last_adv_snapshot = adv.detach().cpu().numpy()
last_clip_active_snapshot = clip_active.detach().cpu().numpy().astype(bool)
logs.append({"update": update, "timesteps": update * ROLLOUT_STEPS, **metrics, "episodes": len(episode_rewards)})
if update % LOG_EVERY_UPDATES == 0:
recent = episode_rewards[-10:]
recent_mean = float(np.mean(recent)) if recent else float("nan")
print(
f"update {update:>3}/{N_UPDATES} | "
f"episodes={len(episode_rewards):>4} | "
f"recent_reward_mean(10)={recent_mean:>7.2f} | "
f"clip_frac={metrics['clip_frac']:.3f} | "
f"approx_kl={metrics['approx_kl']:.4f}"
)
update 1/40 | episodes= 21 | recent_reward_mean(10)= 22.40 | clip_frac=0.000 | approx_kl=0.0000
update 2/40 | episodes= 44 | recent_reward_mean(10)= 27.20 | clip_frac=0.000 | approx_kl=0.0002
update 3/40 | episodes= 68 | recent_reward_mean(10)= 18.70 | clip_frac=0.000 | approx_kl=0.0020
update 4/40 | episodes= 85 | recent_reward_mean(10)= 33.00 | clip_frac=0.000 | approx_kl=-0.0003
update 5/40 | episodes= 105 | recent_reward_mean(10)= 31.20 | clip_frac=0.000 | approx_kl=-0.0000
update 6/40 | episodes= 123 | recent_reward_mean(10)= 31.90 | clip_frac=0.000 | approx_kl=-0.0002
update 7/40 | episodes= 142 | recent_reward_mean(10)= 27.60 | clip_frac=0.000 | approx_kl=0.0006
update 8/40 | episodes= 159 | recent_reward_mean(10)= 35.70 | clip_frac=0.000 | approx_kl=0.0001
update 9/40 | episodes= 179 | recent_reward_mean(10)= 23.50 | clip_frac=0.000 | approx_kl=-0.0000
update 10/40 | episodes= 196 | recent_reward_mean(10)= 20.40 | clip_frac=0.000 | approx_kl=0.0001
update 11/40 | episodes= 212 | recent_reward_mean(10)= 35.00 | clip_frac=0.000 | approx_kl=0.0001
update 12/40 | episodes= 229 | recent_reward_mean(10)= 31.80 | clip_frac=0.000 | approx_kl=-0.0005
update 13/40 | episodes= 242 | recent_reward_mean(10)= 39.50 | clip_frac=0.000 | approx_kl=0.0001
update 14/40 | episodes= 255 | recent_reward_mean(10)= 45.10 | clip_frac=0.000 | approx_kl=0.0008
update 15/40 | episodes= 268 | recent_reward_mean(10)= 43.00 | clip_frac=0.000 | approx_kl=0.0001
update 16/40 | episodes= 285 | recent_reward_mean(10)= 26.40 | clip_frac=0.000 | approx_kl=0.0067
update 17/40 | episodes= 300 | recent_reward_mean(10)= 37.20 | clip_frac=0.000 | approx_kl=0.0007
update 18/40 | episodes= 316 | recent_reward_mean(10)= 33.90 | clip_frac=0.002 | approx_kl=0.0029
update 19/40 | episodes= 336 | recent_reward_mean(10)= 26.90 | clip_frac=0.000 | approx_kl=-0.0001
update 20/40 | episodes= 353 | recent_reward_mean(10)= 29.70 | clip_frac=0.000 | approx_kl=0.0003
update 21/40 | episodes= 367 | recent_reward_mean(10)= 34.90 | clip_frac=0.000 | approx_kl=0.0001
update 22/40 | episodes= 387 | recent_reward_mean(10)= 27.10 | clip_frac=0.028 | approx_kl=0.0119
update 23/40 | episodes= 406 | recent_reward_mean(10)= 23.80 | clip_frac=0.089 | approx_kl=0.0049
update 24/40 | episodes= 424 | recent_reward_mean(10)= 24.30 | clip_frac=0.006 | approx_kl=0.0045
update 25/40 | episodes= 443 | recent_reward_mean(10)= 28.40 | clip_frac=0.015 | approx_kl=0.0029
update 26/40 | episodes= 459 | recent_reward_mean(10)= 30.90 | clip_frac=0.000 | approx_kl=0.0003
update 27/40 | episodes= 478 | recent_reward_mean(10)= 25.30 | clip_frac=0.021 | approx_kl=0.0051
update 28/40 | episodes= 495 | recent_reward_mean(10)= 36.30 | clip_frac=0.001 | approx_kl=0.0019
update 29/40 | episodes= 510 | recent_reward_mean(10)= 35.20 | clip_frac=0.002 | approx_kl=0.0024
update 30/40 | episodes= 521 | recent_reward_mean(10)= 43.80 | clip_frac=0.006 | approx_kl=0.0060
update 31/40 | episodes= 538 | recent_reward_mean(10)= 30.40 | clip_frac=0.049 | approx_kl=0.0050
update 32/40 | episodes= 556 | recent_reward_mean(10)= 29.40 | clip_frac=0.010 | approx_kl=0.0015
update 33/40 | episodes= 566 | recent_reward_mean(10)= 47.20 | clip_frac=0.002 | approx_kl=0.0062
update 34/40 | episodes= 586 | recent_reward_mean(10)= 23.00 | clip_frac=0.003 | approx_kl=0.0007
update 35/40 | episodes= 595 | recent_reward_mean(10)= 52.80 | clip_frac=0.001 | approx_kl=-0.0004
update 36/40 | episodes= 606 | recent_reward_mean(10)= 49.00 | clip_frac=0.002 | approx_kl=0.0068
update 37/40 | episodes= 621 | recent_reward_mean(10)= 31.80 | clip_frac=0.095 | approx_kl=0.0067
update 38/40 | episodes= 632 | recent_reward_mean(10)= 44.30 | clip_frac=0.000 | approx_kl=0.0029
update 39/40 | episodes= 646 | recent_reward_mean(10)= 32.90 | clip_frac=0.000 | approx_kl=0.0005
update 40/40 | episodes= 657 | recent_reward_mean(10)= 46.00 | clip_frac=0.001 | approx_kl=0.0036
df_logs = pd.DataFrame(logs)
df_logs.head()
| update | timesteps | policy_loss | value_loss | entropy | clip_frac | approx_kl | episodes | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 512 | -0.003087 | 89.048697 | 0.692916 | 0.0 | 0.000009 | 21 |
| 1 | 2 | 1024 | -0.001868 | 88.419497 | 0.691513 | 0.0 | 0.000157 | 44 |
| 2 | 3 | 1536 | -0.004179 | 78.124797 | 0.688415 | 0.0 | 0.002013 | 68 |
| 3 | 4 | 2048 | -0.001415 | 110.220530 | 0.686069 | 0.0 | -0.000294 | 85 |
| 4 | 5 | 2560 | -0.001063 | 96.217341 | 0.683129 | 0.0 | -0.000024 | 105 |
7) Plotly: reward per episode (learning curve)#
This is the most direct signal for whether the policy is improving.
df_ep = pd.DataFrame({"episode": np.arange(len(episode_rewards)), "reward": episode_rewards})
window = 20
if len(df_ep) >= window:
df_ep["reward_ma"] = df_ep["reward"].rolling(window).mean()
fig = px.line(df_ep, x="episode", y="reward", title="CartPole reward per episode")
if "reward_ma" in df_ep.columns:
fig.add_trace(
go.Scatter(x=df_ep["episode"], y=df_ep["reward_ma"], name=f"MA({window})")
)
fig.update_layout(xaxis_title="Episode", yaxis_title="Total reward")
fig.show()
8) Plotly: PPO diagnostics over updates#
We visualize clipping behavior and losses over training updates.
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["clip_frac"], name="clip_frac"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["approx_kl"], name="approx_kl"))
fig.update_layout(title="PPO diagnostics", xaxis_title="Update", yaxis_title="Value")
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["policy_loss"], name="policy_loss"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["value_loss"], name="value_loss"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["entropy"], name="entropy"))
fig.update_layout(title="Losses over updates", xaxis_title="Update", yaxis_title="Loss / entropy")
fig.show()
9) Plotly: policy ratios (r_t) and clipping#
Below we plot the distribution of (r_t) and highlight where clipping is active.
The histogram should concentrate near 1.0.
As training progresses, some mass moves outside ([1-\epsilon, 1+\epsilon]), but PPO discourages large deviations.
ratios = last_ratio_snapshot
fig = go.Figure()
fig.add_trace(go.Histogram(x=ratios, nbinsx=60, name="r_t"))
fig.add_vline(x=1.0 - CLIP_EPS, line_dash="dash", line_color="orange")
fig.add_vline(x=1.0, line_dash="dash", line_color="gray")
fig.add_vline(x=1.0 + CLIP_EPS, line_dash="dash", line_color="orange")
fig.update_layout(
title="Policy ratio distribution (last rollout)",
xaxis_title="r_t = pi_new(a|s) / pi_old(a|s)",
yaxis_title="Count",
)
fig.show()
df_ratio = pd.DataFrame(
{
"ratio": last_ratio_snapshot,
"advantage": last_adv_snapshot,
"clip_active": last_clip_active_snapshot,
}
)
fig = px.scatter(
df_ratio,
x="ratio",
y="advantage",
color="clip_active",
title="Where clipping is active (last rollout)",
labels={"ratio": "r_t", "advantage": "A_t"},
)
fig.add_vline(x=1.0 - CLIP_EPS, line_dash="dash", line_color="orange")
fig.add_vline(x=1.0, line_dash="dash", line_color="gray")
fig.add_vline(x=1.0 + CLIP_EPS, line_dash="dash", line_color="orange")
fig.show()
10) Stable-Baselines PPO1 (web research)#
A Stable-Baselines implementation of PPO1 exists (legacy TensorFlow 1.x codebase):
Repo: https://github.com/hill-a/stable-baselines
PPO1 package: https://github.com/hill-a/stable-baselines/tree/master/stable_baselines/ppo1
Main file: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo1/pposgd_simple.py
Exposes
class PPO1(...)(imported bystable_baselines/ppo1/__init__.py)
The original OpenAI Baselines PPO implementation is also available:
https://github.com/openai/baselines/tree/master/baselines/ppo1
Example usage (not run here):
from stable_baselines import PPO1
import gym
env = gym.make("CartPole-v1")
model = PPO1("MlpPolicy", env, clip_param=0.2, timesteps_per_actorbatch=2048)
model.learn(total_timesteps=1_000_000)
Note: Stable-Baselines is archived/legacy and uses TF1/MPI; Stable-Baselines3 is PyTorch and offers PPO (conceptually closer to PPO2-style implementations).
11) Stable-Baselines PPO1 hyperparameters (explained)#
Stable-Baselines PPO1 (legacy TensorFlow/MPI) exposes the following constructor signature (from stable_baselines/ppo1/pposgd_simple.py):
PPO1(
policy,
env,
gamma=0.99,
timesteps_per_actorbatch=256,
clip_param=0.2,
entcoeff=0.01,
optim_epochs=4,
optim_stepsize=1e-3,
optim_batchsize=64,
lam=0.95,
adam_epsilon=1e-5,
schedule='linear',
verbose=0,
tensorboard_log=None,
_init_setup_model=True,
policy_kwargs=None,
full_tensorboard_log=False,
seed=None,
n_cpu_tf_sess=1,
)
What each hyperparameter does#
policy: policy class (or registered string) likeMlpPolicy,CnnPolicy, etc.env: Gym env instance or an env id string (e.g.'CartPole-v1').gamma: discount factor \(\gamma\).timesteps_per_actorbatch: number of environment steps collected per update per actor (batch size).clip_param: PPO clip parameter \(\epsilon\).entcoeff: entropy coefficient (larger → more exploration pressure).optim_epochs: number of epochs over the on-policy batch per update.optim_stepsize: optimizer step size (learning rate), optionally controlled byschedule.optim_batchsize: minibatch size.lam: GAE(\(\lambda\)) parameter.adam_epsilon: Adam epsilon for numerical stability.schedule: learning-rate schedule type (e.g.'linear','constant', …).
Mapping to this notebook#
SB
timesteps_per_actorbatch→ this notebook’sROLLOUT_STEPSSB
clip_param→CLIP_EPSSB
entcoeff→ENT_COEFSB
optim_epochs→UPDATE_EPOCHSSB
optim_stepsize→LEARNING_RATESB
optim_batchsize→MINIBATCH_SIZESB
gamma→GAMMASB
lam→GAE_LAMBDASB
adam_epsilon→ADAM_EPSSB
schedule→ not implemented here (easy extension: linearly decayLEARNING_RATEover updates)
Practical tuning hints#
If reward collapses: reduce
LEARNING_RATE, reduceUPDATE_EPOCHS, or reduceCLIP_EPS.If learning is slow: increase
ROLLOUT_STEPS, increaseUPDATE_EPOCHS, or slightly increaseLEARNING_RATE.Watch
approx_klandclip_frac: sustained high values mean policy updates are too aggressive.
Pitfalls + exercises#
If training is unstable: lower
LEARNING_RATE, check advantage normalization, and verify the done/bootstrapping logic.If
clip_fracis near 0.0: updates may be too small (try higher LR or more epochs).If
clip_fracis very high: updates are too aggressive (try smaller LR or smallerCLIP_EPS).
Exercises#
Add an entropy bonus (
ENT_COEF > 0) and compare learning curves.Implement value function clipping (as in some PPO variants) and compare critic stability.
Switch to a continuous-action env (e.g., Pendulum) using a Gaussian policy.
References#
Schulman et al., Proximal Policy Optimization Algorithms (2017): https://arxiv.org/abs/1707.06347
Stable-Baselines PPO1 source (TF1): https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo1/pposgd_simple.py